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1. INTRODUCTION 

Machine learning (ML) algorithms have parameters that govern their operational efficiencies [1]. The 
parameters are initiated before training commences. For the past three decades, most applications of machine 
learning use a single layer feed forward network (SLFN). The backbone of the SLFN training algorithms is the 
backpropagation (BP) method. That is, parameters learn at each iteration based on the first-order instantaneous 
value of the cost function [2]. These parameters have to be tuned to minimize the cost function. The major 
challenge is that the parameters require iterative tuning, which results in the slowness of the machine learning 
algorithms [3]. With every training, the old and new datasets retrain in batch learning algorithms [4]. This 
method consumes much time, therefore a quest among researchers for a fast and scalable machine learning 
algorithm to alleviate the problem of long training time. 

Huang proposed extreme learning machines (ELM) [5]. ELM learning principle is essentially a linear 
model. ELM randomly assigns the input weights and biases to the hidden neurons, then computes the output 
of the hidden neurons and uses Moore Penrose generalized inverse to determine the output weights analytically 
[6]. Therefore, the input weights and biases no longer require iterative tuning as it was in the conventional 
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learning machines [7], [8]. It is faster than the backpropagation (BP) algorithms as it is a one-pass algorithm 
[9]. Thus, ELM has been exploited by researchers for classification in recent times [10]. However, ELM has 
the problem that its weights and biases which are randomly assigned make it to be ill-conditioned, this affects 
its accuracy and generalization performance [11]. This problem must be addressed to exploit its fast training 
advantage in the classification of datasets. 

Deng et al. [12] adopted regularization parameters to address the problem of random assignment of 
input weights and biases of ELM. They based the approach on ridge regression theory and weighted least 
squares. Martínez et al. in [13] improved the work of Deng et al. They proposed the use of ridge regression, 
elastic net, and lasso methods to prune the size of hidden neurons in ELM architecture. They validated their 
work with some regression benchmark tasks, and it was proved to scale a more compact network with a 
competitive result when compared with ELM. However, [10] appraised the generalization of their algorithms 
but shows that the ridge regularized ELM requires large memory space, and since large matrix inversion is 
involved, the computational cost is high. Therefore, they proposed the generalized regularized ELM (GR- 
ELM) approach for multiclass classification tasks. The approach combined the Frobenius norm and £3, norm 
of output weights as ELM penalty. The R-ELM was maintained for binary classification tasks. They used 
alternating direction method for multiplier (ADMM) for implementation. They came up with a more compact 
network structure. However, the approach becomes more complex and the issue of computational cost remains 
unresolved. 

Other than procedural algorithms above, recent work enhance the performance ELM classification 
with optimization techniques [14], [15]. Optimization techniques select good parameters for the efficient 
performance of the ELM algorithm [16]. They generate a Pareto front, from which updating searches an 
optimum candidate solution. Physical activities like hill-climbing, migration, maneuvering emissions; 
biological behavior of Beetles, Birds, Bees, Ants, Bats, Fish, Cats; and other evolutionary forms had been 
modeled in optimization to improve ELM [17]-[19]. 

Eshtay et al. [6] employed competitive swarm optimization based neural network (CSONN) to control 
the complex nature of the ELM network. The algorithm optimized the weights and biases and determined the 
size of the hidden nodes dynamically. They used 23 benchmark functions for the simulations and compared 
the results with the static rule based ELM and some other meta-heuristic based ELM. Their results improve the 
generalization of ELM and perform better than static rule-based ELM. However, the algorithm was more 
computational complex than the traditional ELM and the static rule-based ELM. Yang and Duan [20] proposed 
a hybrid model of artificial bee colony (ABC) and differential evolution (DE) optimization techniques to 
improve the parameter selection of ELM. The model improved the generalization performance with less 
processing time offered by ELM. The deficiency of initial random assignment of input weights and biases was 
also improved, and the results of the classification were also improved. However, the exploitation of ABC is 
poor [21] and the DE is computationally intensive [22]. Xie et al. [23] proposed collaborative ELM to prevent 
repeated computations that emanate from data redundancy. They employed the use of confidence interval to 
enhance the traditional ELM algorithm. With the approach, they were able to eliminate redundant computations 
of the neural network nodes. The approach improves the efficiency of ELM classification. However, the 
approach did not consider the selection optima input weights and biases for ELM, therefore, ELM is still subject 
to being stuck in a local minimum. There are many other state-of-the-arts optimization techniques used recently 
to improve the performance of ELM. These include grey wolf optimizer (GWO), bat algorithm (BA), bacterial 
foraging optimization (BFO), [20], [24]-[26], and many more. These hybrid techniques improved the 
performance of ELM in some ways; however, they are challenged with being stock in local minimal, which 
reduced its classification accuracy. MFO has a better trade-off between the exploration and exploitation in the 
search space than any of the above schemes, with less computational cost. Therefore, we propose an enhanced 
MFO-ELM. The proposed algorithm is set to achieve the following: i) To improve ELM generalization 
performance. ii) Introducing a meta-heuristic algorithm to select optima input weights and biases for ELM. iii) 
Implementation of the proposed MFO-ELM on five machine learning datasets and comparatively evaluating 
the results. The remaining sections are organized as follows: section 2 is the proposed method, section 3 is the 
research method, section 4 is the results and discussion, and section 5 is the conclusion. 


2 THE PROPOSED METHOD 
2.1. Extreme learning machine 

ELM is a single layer feedforward neural (SLFN) network. Assuming an arbitrary N cases, each 
instance has d-dimensional feature and belongs to 1 of m classes in the set. The dataset can be represented 
as(x;,y;),i = 1,2,...,N, were x;is the input vector x; E€ R”, and y;is the expected result y; E€ R™. For an 
SLEN network with L hidden neurons, and g(x) activation function, the network is to be trained with vectors 
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of Xi = [Xia Xin) Xin)’, i = 1,2,...,N, and the target vector Y; = [Y;1; Yiz" Yim)". The output of the 
neural network is mathematically modeled in (1). 


L 
i=1 


Where w; = [wjr Wiz ts Wjn]” is a vector of weights connecting the it? input node to jt” hidden neuron, and 
b; is the bias of j th hidden neuron. By = jobi Beal” is a vector of output weights between the hidden 
neurons and the output nodes, wj. x; is the inner product of w; and x;. In (1) can be compactly presented as in 
(2) for N system of equations. 


Y =H (2) 


where H, and fare the hidden neurons output and weight respectively, and Y is the ELM output. The operating 
principle of extreme learning machines is based on empirical risk management. ELM is ill-conditioned because 
the input weights and biases are randomly assigned, therefore it tends to over-fit. 


2.2. Moth-flame optimization technique 

MFO is a population-based bionic optimization algorithm. It regulates the exploration and exploitation 
during the search process. We consider a search space with n moths and d positions. The moth positions are 
initialized randomly using the (3) within an interval [-1 and 1] as in [27]: 


Xij = r» (ub; — 1b;) + lbj (3) 


where x;j is moth it" in feature location j*" of the search space, i = 1,2,...,n,j = 1,2,...,d; lb; and ub; are 
lower and upper bounds of moth positions in the space. Each moth represents a candidate solution. The moth 
positions represent the input weights and biases to be optimized. The position vector of each moth in the search 
space is regulated by a flag operator. The flag ensures optimal fitness values. Each position vector is passed to 
the fitness function to calculate the fitness value. 

The moths and the flames are solutions in the algorithm search space. The moths represent the particles 
that move around the flames, however, the flames are a matrix of the best moth positions attained so far. Each 
flame is assigned to a moth to prevent local optimal stagnation. As the moths search around the flames, they 
are updated if there is a better solution. The process will continue until it reaches the maximum set iteration 
for all the moths to attain their best possible solutions. A logarithmic spiral model [22] is used as the main 
update mechanism of moths. In (4) is the logarithmic spiral for the MFO algorithm. 


S(M; Fi) = De**Cos(2nt) + F; aj 


where S is a function model that controls the flying of a moth around a flame, which may not necessarily be in 
the space between them. This is set to regulate the exploration and exploitation of the model. M; indicates the i” 
moth, F; indicates the j" flame, and D; indicates the Euclidian distance of the i” moth for the j” flame, a is a 
constant for defining the logarithmic spiral’s shape, t parameter-random number in [-1,1] which specifies how 
close the next moth position should be to the flame. 27rt is the distance between successive turns of the spirals. a 
is the linear decrease from -1 to -2 throughout the iteration. It determines the convergence of the algorithm. The 
decrement in the flames attempts to regulate the exploration and exploitation of the search space. 

The effectiveness of the algorithm strongly depends on the distance D between the moths and the 
flame. The position update of moths relative to n locations may degrade the exploitation of the candidate 
solution. Hence, an adaptive model shown in (5) is employed to determine the number of flames. 


flameNumber = round (N —Ix =) (5) 


l — the current iteration, N - the flames count, T — the maximum iteration. The position update of the moths is 
carried out with respect to the best flame in the last iteration. After the termination, the best moth is returned 
as the best optimal approximation. 
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3 METHOD 
To improve the classification accuracy of ELM in this study, we use (3) to generate a population of 100 


candidate solutions. Then, we employed MFO to optimize the population. The best candidate solution was 
selected and reshaped into weights and biases. The weights are random numbers in the range [-1, 1] while the 
biases are the range of [0, 1]. These parameters are passed to ELM for classification. We present the conceptual 
design in Figure 1, the algorithm in Algorithm 1 and the optimization parameters are from [28]-[30]. 

We run the algorithm on five life datasets. Four (4) of the datasets were drawn from the UCI repository: 
Blood, Breast, Diabetes, and Liver; while the one (1) from the datahub repository is the Phoneme dataset. The 
datasets were normalized to ensure the even distribution of data points and avoid the effect of being skewed 
towards features with higher values. The datasets were partitioned into training and testing data in ratio 2:1 as shown 
in the Data pre-processing phase of Figure 1. We constructed ten (10) SLFN at an interval of 5 nodes (range between 
5 and 50 nodes). Each simulation reached the best accuracy within the node range. For every SLFN construct, there 
were thirty (30) trials of simulations, then the average results are computed. The results of particle swarm 
optimization extreme learning machines (PSO-ELM) and competitive swarm optimization extreme learning 
machines (CSO-ELM) algorithms are drawn from [31] for performance comparison. 
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Figure 1. Flowchart of the enhanced MFO-ELM scheme 


Algorithm 1. Enhanced MFO-ELM algorithm 

Initialize MFO parameters 

Output: Accuracy 

Generate guess using equation (3) and evaluate the fitness 
while iter <iterM do 
Update flame number using equation (5) 
sort moth with fitness values and select the best 
use equation (5) to calculate linear decrement from -1 to -2 
D = abs(Fij- Xij) //D is the distance between the flame and the corresponding moth 
for i= 1 to N do 

Update the moth positions with respect to flame 
11: end for 
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12: Select the best flame: Iter = iter +1 

13: end while 

14: Reshape the best flame (F) into weights and biases (w,b) 
15: Calculate hidden neuron output H 

16: Calculate the output weight £f 

17: Calculate ELM output with equation (2) 

28: Calculate the misclassification and the accuracy 
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4 RESULTS AND DISCUSSION 

We compared the improved accuracy of the proposed MFO-ELM with CSO-ELM, PSO-ELM, and 
ELM in this section for each dataset as shown in Figure 2. The comparative performance for each model in 
each dataset is summarized in Figure 3. Also, we extend our discussion to include the rate of improvement of 
the proposed algorithm over these algorithms and analyze the significant improvement test of the proposed 
algorithm on classical ELM. 


4.1. Evaluation of accuracy 

We evaluated the classification performance Figure 2 of MFO-ELM with CSO-ELM, PSO-ELM, and 
ELM in Figures 2(a)-(e) for each dataset. We presented the relative performance of the algorithms in Figure 3. 
The relative performance is the average classification accuracy of each algorithm in the simulations. This is 
presented in a single chart for easy comparison. The results show that the proposed MFO-ELM performed better 
in Blood, Breast, Diabetes, and Liver datasets which represents 80% of the simulations. It is only in the simulation 
on Phoneme datasets that CSO-ELM and PSO-ELM performed better than MFO-ELM. 
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Figure 2. Compared the improved accuracy of the MFO-ELM: (a) performance comparison of classification 
on blood dataset, (b) performance comparison of classification on breast dataset, (c) performance comparison 
of classification on diabetes dataset, (d) performance comparison of classification on liver dataset, 
and (e) performance comparison of classification on phoneme dataset 
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The overall evaluation of the comparison shows that the enhancement of ELM with MFO improves 
its performance. The mean accuracies of ELM are improved on all the simulations on the five datasets. That 
is the accuracies of MFO-ELM is better than ELM in all. Therefore, optimizing the input weights and biases 
with MFO improved the accuracy of the ELM algorithm. 
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Figure 3. Relative classification accuracy on each dataset 


4.2. Percentage performance improvement rate (PIR%) 

More so, the percentage performance improvement rate (PIR%) evaluates the efficiency of the MFO- 
ELM on ELM and compared it with PSO-ELM and CSO-ELM algorithms. The higher the PIR% the better the 
improvement on the ELM algorithm. PIR% is expressed in (6): 


PIR% = As « 100 (6) 


where A, and B, are the scores of two comparative algorithms, A, is the proposed algorithm while B, is the 
benchmark algorithm. 

Table 1 shows the PIR% of classification accuracy of the MFO-ELM scheme and the other two 
algorithms (PSO-ELM and CSO-ELM) on ELM for the five selected datasets. On the Blood dataset, the PIR% 
gained by MFO-ELM over ELM is 1.0014, PSO-ELM is 0.4381, while CSO-ELM scaled negatively with - 
0.2754. This shows that MFO better enhanced the performance of ELM than the other two meta-heuristic 
algorithms. For simulation on the Breast cancer dataset, the PIR% of MFO-ELM on ELM is 0.4914, while 
PSO-ELM and CSO-ELM had negative scores of -0.3890 and 0.4197 respectively. Therefore, MFO-ELM has 
a superior improvement rate on ELM and the other meta-heuristic algorithms did not. 


Table 1. Summary of PIR% of each meta-heuristic algorithm on ELM 


ELM PSO-ELM CSO-ELM MFO-ELM 
Accuracy Accuracy PIR% Accuracy PIR% Accuracy PIR% 
Blood 79.89 80.24 0.4381 76.67 -0.2754 80.69 1.0014 
Breast 97.68 97.30 -0.3890 97.27 -0.4197 98.16 0.4914 
Diabetes 77.87 76.48 -1.7850 71415 -0.9246 78.52 0.8647 
Liver 71.92 70.03 -2.6279 73.22 1.8076 76.29 6.0762 
Phoneme 82.42 83.49 1.2982 83.46 1.2618 82.95 0.6431 


Similarly, on the Diabetes dataset, the PIR% of MFO-ELM on ELM is 0.8647, while for PSO-ELM, 
CSO-ELM the rates are negative with -1.7805 and -0.9246 respectively. MFO-ELM proves its superiority over 
all algorithms. Also, CSO-ELM showed a better improvement rate than PSO-ELM with respective scores of 
1.8076 and -2.6279 respectively. However, MFO-ELM has a substantial higher improvement rate of 6.0762. 
Contrary to the results of simulations on the previous datasets, the improvement rate of MFO-ELM on the 
Phoneme dataset is less than PSO-ELM and CSO-ELM. While MFO-ELM has an improvement rate of 0.6431, 
PSO-ELM and CSO-ELM have improvement rates of 1.2982 and 1.2618 respectively. PSO-ELM shows a 
higher improvement rate than any other algorithm. This follows the theorem of “No free lunch for all” [32]. 
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The accuracy and PIR% justify that the enhanced MFO-ELM improves the accuracy of ELM classification 
than the other meta-heuristic optimization algorithms. This is because MFO has a better exploration and 
exploitation process of the search space than the other two meta-heuristic algorithms, which eventually led to 
the selection of optimal parameters for ELM. 


4.3. Test of significant improvement of MFO-ELM on ELM classification 

We measured the significant improvement of the MFO-ELM algorithm on the ELM algorithm with 
the Wilcoxon signed ranked test [31], [33]. Table 2 presents the analysis of the test. Mean is the average 
classification accuracies of the two comparison algorithms, std dev. is the standard deviation, min and max are 
the respective minimum and maximum accuracies. The Wilcoxon signed rank test column shows rank 
(negative (Neg.) or positive (Pos.)), the number of samples N, mean rank M and p-value is the 2-tail significant 
values. Rank shows the rankings of the algorithms based on the key below the table. The positive rank is the 
total number of accuracies of MFO-ELM that are better than the ELM’s in the simulations, while the negative 
rank is that of ELM that is better than MFO-ELM’s accuracies, and Ties is the number of equal accuracies of 
the algorithms. The better result of the two comparisons for statistical and Wilcoxon rank test measures are 
bold in the table. 

For the blood dataset, the average accuracies of MFO-ELM ranked higher than the ELM algorithm in 
all the statistical and Wilcoxon rank test measures. We obtained similar results for Breast and Phoneme dataset 
simulations. Also, the p-values of the simulations of these three datasets are less than 0.05 which shows a 
significant improvement of MFO-ELM on the standard ELM. The statistical measures for MFO-ELM are higher 
than the ELM algorithm on the Diabetes dataset’s simulations. However, the Wilcoxon signed rank shows that 
the improvement measure is not significant. Similar to Diabetes simulations, the results for the Liver dataset for 
ELM have better standard deviations than MFO-ELM in the statistics columns, The Wilcoxon signed rank test 
also favours ELM more positively than MFO-ELM, and the p-Value shows no significant improvement. 

The overall evaluation of the comparison shows that MFO improves the ELM performance. 
Statistically, the mean accuracies are improved on all the simulations on the five datasets. That is the accuracies 
of MFO-ELM is better than ELM in all. Also, the standard deviations of statistics of MFO-ELM are less than 
ELM in four (4) out of the five (5) datasets, which means the algorithm is more stable in 80% of the whole 
simulations. More so, with the Wilcoxon Signed rank test, MFO-ELM ranked higher than ELM in three (3), 
and equal in one, but lost out in only one. This shows a 70% improvement ranking. The p-values also prove a 
significant improvement in 60% of the simulations. Therefore, optimizing the input weights and biases with 
MFO significantly improved the stability and accuracy of the ELM algorithm. 


Table 2. Wilcoxon signed rank test for MFO-ELM and ELM classification accuracy 


Wilcoxon signed rank test: 


Dataset Algorithm ae ; MEO EM 
Mean StdDev Min. Max Rank N M p- 
Value 
Blood ELM 78.810 0.007 77.380 79.890 Neg. 0 0.00 0.005 
MFO-ELM 79.940 0.003 79.630 80.690 Pos. 10 5.50 
Ties 0 
Breast ELM 96.860 0.013 93.150 97.650 Neg. 2 2.50 0.022 
MFO-ELM 97.560 0.003 97.210 98.160 Pos. 8 6.25 
Ties 0 
Diabetes ELM 75.880 0.015 72.170 77.870 Neg. 5 5.80 0.878 
MFO-ELM 76.000 0.014 74.340 78.520 Pos. > 5.20 
Ties 0 
Liver ELM 69.580 0.026 63.010 71.920 Neg. 6 3.83 0.646 
MFO-ELM 70.030 0.043 65.420 76.290 Pos. 4 8.00 
Ties 0 
Phoneme ELM 78.810 0.026 74.100 82.420 Neg. 0 0.00 0.005 
MFO-ELM 82.090 0.014 78.710 82.950 Pos. 10 5.50 
Ties 0 


Note: The bold values indicate the best results 
Neg. Ranks: MFO-ELM < ELM 

Pos. Ranks: MFO-ELM > ELM 

Ties: MFO-ELM = ELM 


5. CONCLUSION 

This study proposed an enhanced MFO-ELM algorithm. The proposed algorithm used MFO to set the 
initial value of input weights and hidden neuron biases for the ELM classifier. It was applied to the 
classification of some medical datasets. The overall performance shows that the proposed MFO-ELM 
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algorithm improved the basic ELM algorithm in all five datasets. That is, MFO-ELM has improved the 
performance of ELM 100% in the simulations. This proves that the initial setting of the input weights and 
biases of ELM by the MFO optimization scheme enhances the performance of the ELM classifier. More so, 
when comparing MFO-ELM with other ELM enhanced optimization algorithms, it is only in the Phoneme 
dataset that PSO-ELM and CSO-ELM performed better than the proposed algorithm. It is superior to the other 
meta-heuristic algorithms in 80% of the simulations. Further study will be focused on the hybridization of two 
meta-heuristic algorithms to improve the parameter setting of ELM. 
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